Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Machine learning has been widely used in healthcare applications to approximate complex models, for clinical diagnosis, prognosis, and treatment. As deep learning has the outstanding ability to extract information from time series, its true capabilities on sparse, irregularly sampled, multivariate, and imbalanced physiological data are not yet fully explored. In this paper, we systematically examine the performance of machine learning models for the clinical prediction task based on the EHR, especially physiological time series. We choose Physionet 2019 challenge public dataset to predict Sepsis outcomes in ICU units. Ten baseline machine learning models are compared, including 3 deep learning methods and 7 non-deep learning methods, commonly used in the clinical prediction domain. Nine evaluation metrics with specific clinical implications are used to assess the performance of models. Besides, we sub-sample training dataset sizes and use learning curve fit to investigate the impact of the training dataset size on the performance of the machine learning models. We also propose the general pre-processing method for the physiology time-series data and use Dice Loss to deal with the dataset imbalanced problem. The results show that deep learning indeed outperforms non-deep learning, but with certain conditions: firstly, evaluating with some particular evaluation metrics (AUROC, AUPRC, Sensitivity, and FNR), but not others; secondly, the training dataset size is large enough (with an estimation of a magnitude of thousands).
translated by 谷歌翻译
The spread of misinformation is a prominent problem in today's society, and many researchers in academia and industry are trying to combat it. Due to the vast amount of misinformation that is created every day, it is unrealistic to leave this task to human fact-checkers. Data scientists and researchers have been working on automated misinformation detection for years, and it is still a challenging problem today. The goal of our research is to add a new level to automated misinformation detection; classifying segments of text with persuasive writing techniques in order to produce interpretable reasoning for why an article can be marked as misinformation. To accomplish this, we present a novel annotation scheme containing many common persuasive writing tactics, along with a dataset with human annotations accordingly. For this task, we make use of a RoBERTa model for text classification, due to its high performance in NLP. We develop several language model-based baselines and present the results of our persuasive strategy label predictions as well as the improvements these intermediate labels make in detecting misinformation and producing interpretable results.
translated by 谷歌翻译
互补标签学习(CLL)是弱监督的情况下的常见应用。但是,在实际数据集中,CLL遇到了平衡的培训样本,其中一个类的样品的数量明显低于其他类别的样本。不幸的是,现有的CLL方法尚未探索类饮食样本的问题,从而降低了预测准确性,尤其是在不平衡的类中。在本文中,我们提出了一个新颖的问题设置,以允许从类不平衡的互补标签样品中学习以进行多类分类。因此,为了解决这个新的问题,我们提出了一种新的CLL方法,称为加权互补标签学习(WCLL)。提出的方法通过利用类不平衡互补标记的信息来模拟加权的经验风险损失,这也适用于多类不平衡训练样本。此外,提出的方法的估计误差结合是提供理论保证的。最后,我们对广泛使用的基准数据集进行了广泛的实验,以通过将其与现有最新方法进行比较来验证我们的方法的优势。
translated by 谷歌翻译
基于骨架的人类动作识别最近引起了人们对外观变化的敏感性和更多骨架数据的可访问性的敏感性。但是,即使在实践中捕获的3D骨骼也对观点和方向仍然敏感,并给出了不同人体关节的阻塞和人类关节定位中的误差。骨骼数据的这种视图差异可能会严重影响动作识别的性能。为了解决这个问题,我们在本文中提出了一种新的视图不变的表示方法,而没有任何手动动作标签,用于基于骨架的人类行动识别。具体而言,我们通过最大化从不同观点提取的表示形式之间的相互信息来利用同一个人同时对同一个人进行的多视图骨架数据,然后提出一个全局 - 局部对比度损失,以模拟多规模CO - 空间和时间域中的发生关系。广泛的实验结果表明,所提出的方法对输入骨骼数据的视图差异是可靠的,并显着提高了基于无监督骨架的人类动作方法的性能,从而在两个具有挑战性的多视图上产生了新的最新精确度Pkummd和NTU RGB+d的基准。
translated by 谷歌翻译
多年来,Yolo系列一直是有效对象检测的事实上的行业级别标准。尤洛社区(Yolo Community)绝大多数繁荣,以丰富其在众多硬件平台和丰富场景中的使用。在这份技术报告中,我们努力将其限制推向新的水平,以坚定不移的行业应用心态前进。考虑到对真实环境中速度和准确性的多种要求,我们广泛研究了行业或学术界的最新对象检测进步。具体而言,我们从最近的网络设计,培训策略,测试技术,量化和优化方法中大量吸收了思想。最重要的是,我们整合了思想和实践,以在各种规模上建立一套可供部署的网络,以适应多元化的用例。在Yolo作者的慷慨许可下,我们将其命名为Yolov6。我们还向用户和贡献者表示热烈欢迎,以进一步增强。为了了解性能,我们的Yolov6-N在NVIDIA TESLA T4 GPU上以1234 fps的吞吐量在可可数据集上击中35.9%的AP。 Yolov6-S在495 fps处的43.5%AP罢工,在相同规模〜(Yolov5-S,Yolox-S和Ppyoloe-S)上超过其他主流探测器。我们的量化版本的Yolov6-S甚至在869 fps中带来了新的43.3%AP。此外,与其他推理速度相似的检测器相比,Yolov6-m/L的精度性能(即49.5%/52.3%)更好。我们仔细进行了实验以验证每个组件的有效性。我们的代码可在https://github.com/meituan/yolov6上提供。
translated by 谷歌翻译
这项研究受到人类行为的启发,提议使用探测策略,并将其整合到遍布性分析框架中,以解决未知的粗糙地形上的安全导航。我们的框架将可折叠信息整合到我们现有的遍历性分析中,因为仅视力和几何信息可能会被不可预测的非刚性地形(例如柔软的土壤,灌木丛或水坑)误导。通过新的遍历性分析框架,我们的机器人对不可预测的地形进行了更全面的评估,这对于其在室外环境中的安全至关重要。该管道首先使用RGB-D摄像头确定地形的几何和语义性能,并在可疑地形上探测位置。使用力传感器对这些区域进行探测,以确定机器人在其上面时崩溃的风险。该风险被称为可折叠度度量,该指标估计了不可预测的区域的地面可折叠性。此后,将可折叠性度量以及几何和语义空间数据结合在一起,并分析以产生全局和局部穿术网格图。这些遍历性网格地图告诉机器人是否可以安全地跨越地图的不同区域。然后使用网格图来生成机器人的最佳路径,以安全地导航其目标。在模拟和现实世界实验中,我们的方法已在四足动物的机器人上成功验证。
translated by 谷歌翻译
自适应力矩估计(ADAM)优化器由于其快速收敛属性而广泛用于深度学习任务。但是,亚当的融合仍然不太了解。特别是,对亚当的现有分析不能清楚地证明亚当比SGD的优势。我们将这种理论上的尴尬归因于$ l $ -smooth的条件(即,假设梯度在全球lipschitz连续且常数$ l $)中被文献所采用,而文献经常指出,在实用的神经网络中经常失败。为了解决这一尴尬,我们分析了亚当在轻松的条件下的融合,称为$(l_0,l_1)$平滑度条件,这使梯度Lipschitz常数可以随地梯度规范而变化。 $(l_0,l_1)$严格弱于$ l $ -Smooth条件,并且已经过经验证明可以保留实用的深神经网络。在$(L_0,L_1)$平滑度条件下,我们为Adam建立了与实用的超参数的收敛性。具体而言,我们认为亚当可以适应局部平滑度条件,证明亚当的\ emph {Adpativity}是合理的。相反,在这种情况下,SGD可以任意放慢。我们的结果可能会阐明自适应梯度方法比非自适应方法的好处。
translated by 谷歌翻译
跨言扬声器风格的转移旨在提取给定参考语音的语音样式,可以在任意目标扬声器的音色中复制。有关此主题的现有方法已经探索了利用语音级样式标签通过全球或本地规模样式表示进行样式转移。但是,有声读物数据集通常以本地韵律和全球类型的形式进行特征,并且很少伴有发言级风格的标签。因此,正确地将阅读方式转移到不同的扬声器上仍然是一项具有挑战性的任务。本文旨在介绍块的多尺度跨言式风格模型,以捕获有声读物的全球类型和本地韵律。此外,通过使用拟议的可切换对手分类器来解开扬声器的音色和样式,提取的阅读样式可适应不同扬声器的音色。实验结果证实,该模型设法将给定的阅读方式转移到新的目标扬声器上。在局部韵律和全球流派类型预测指标的支持下,进一步揭示了所提出的方法在多扬声器有声读物中的潜力。
translated by 谷歌翻译